- Overview of probability and statistics
- Chi-squared test: goodness of fit, independence
- Student’s t-test: one sample, paired samples, independent two-sample
- One-way ANOVA test
- Normality test
- Homocedasticity test
Inferential statistics
Marc Comas-Cufí
Given a sample \(X = \{x_1, \dots, x_N\}\), we are interested in analysing a model \(\text{Model}(\theta)\), \(\theta \in \Theta\), under the assumption that \(x_i\)’s are independent and identically distributed, i.e. \(x_i \sim \text{Model}(\theta)\).
\[ \text{P}(\theta_{-} \leq \theta \leq \theta_{+}) = 1-\alpha. \]
Let’s assume \(X \sim Bin(n=1, \pi)\), for example:
Suppose we have:
X2 = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0)X1 would be:If \(X \sim Bin(n=1, \pi)\) and \(N\) is high enough,
\[ \bar{X} \sim N(\mathbb{E}[X], \sqrt{\text{var}[X]/N}) \iff \pi = \mathbb{E}[X] \sim N(\bar{X}, \sqrt{\pi (1-\pi)/N}). \]
\[ \pi = \mathbb{E}[X] \sim N(\bar{X}, \sqrt{\bar{X} (1-\bar{X})/N}) \]
X1 would be:Given a numeric sample:
X = c(11.2, 10.28, 10.24, 10.69, 8.34, 11.06, 9.53, 10.6, 8.78, 9.16,
9.52, 10.63, 10.18, 9.06, 9.87, 11.18, 9.91, 9.26, 9.25, 10.35,
9.52, 10.61, 9.83, 10.14, 9.42, 8.82, 8.84, 9.65, 11.09, 9.49,
10.03, 10.59, 10.64, 10.74, 10.59, 9.04, 8.52, 9.83, 9.62, 9.45,
10.83, 9.65, 10.37, 10.38, 9.66, 10.45, 9.99, 11.11, 10.47, 9.41)diamonds activityExplain the price’s of the diamonds using information from the other variables. What are the more relevant variables to explain the price?
#> # A tibble: 53,940 × 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # … with 53,930 more rows
diamonds is a well-known dataset and analyzed on many websites through the internet. You can borrow ideas from there.quarto o Rmarkdown to generate your document.We say that a result has statistical significance when the result is very unlikely to be occurred given the null hypothesis. Before a hypothesis test is done, a significance level, \(\alpha\), is set. Usually \(\alpha=0.05\).
binom.test(sum(X_bin), n = length(X_bin), p = 0.4)
#>
#> Exact binomial test
#>
#> data: sum(X_bin) and length(X_bin)
#> number of successes = 4, number of trials = 20, p-value = 0.07198
#> alternative hypothesis: true probability of success is not equal to 0.4
#> 95 percent confidence interval:
#> 0.057334 0.436614
#> sample estimates:
#> probability of success
#> 0.2t.test(X~G, data = data)
#>
#> Welch Two Sample t-test
#>
#> data: X by G
#> t = -1.638, df = 16.394, p-value = 0.1205
#> alternative hypothesis: true difference in means between group A and group B is not equal to 0
#> 95 percent confidence interval:
#> -1.8562806 0.2362806
#> sample estimates:
#> mean in group A mean in group B
#> 79.70 80.51Depending on the nature of the two variables, other tests exist to contrast their independence.
Pearson’s chi-squared test
\[ \chi^2 = \sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i} \sim {\chi^2}_{k-1} \]
where \(O_i\) is the number of times \(i\) was observed and \(E_i = n \times \pi_i\).
Shapiro-Wilk test
\[ W = \frac{(\sum_{i=1}^n a_i x_{[i]})^2}{(\sum_{i=1}^n (x_i - \bar{x})^2)^2} \sim f_{W} \]
where \(a_i\)’s are certain constants, \(x_{[i]}\) is the \(i\)-th smallest observation in \(X\) and \(f\) is a probability distribution for r.v. \(W\).
For \(X\) categorical and \(Y\) numerical.
car)broom Statistical Objects into Tibblesbinom.test(c(682, 243), p = 3/4)
#>
#> Exact binomial test
#>
#> data: c(682, 243)
#> number of successes = 682, number of trials = 925, p-value = 0.3825
#> alternative hypothesis: true probability of success is not equal to 0.75
#> 95 percent confidence interval:
#> 0.7076683 0.7654066
#> sample estimates:
#> probability of success
#> 0.7372973With broom:
library(broom)
tidy(binom.test(c(682, 243), p = 3/4))
#> # A tibble: 1 × 8
#> estimate statistic p.value parameter conf.low conf.high method alter…¹
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.737 682 0.382 925 0.708 0.765 Exact binomia… two.si…
#> # … with abbreviated variable name ¹alternative…
km_ = kmeans(iris[1:4], 2, nstart = 100)
tidy(km_)
#> # A tibble: 2 × 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width size withinss cluster
#> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct>
#> 1 6.30 2.89 4.96 1.70 97 124. 1
#> 2 5.01 3.37 1.56 0.291 53 28.6 2
glance(km_)
#> # A tibble: 1 × 4
#> totss tot.withinss betweenss iter
#> <dbl> <dbl> <dbl> <int>
#> 1 681. 152. 529. 1
# augment(km_, iris[1:4])Overview Data Science and Data preprocessing with Karina Gibert.